การเขียนโปรแกรมสำหรับโปรเซสเซอร์ขนานกันอย่างมหาศาล: วิธีปฏิบัติจริง: โมเดลการดำเนินงานของ CUDA: โฮสต์ เทียบกับ อุปกรณ์

โมเดลการดำเนินงานของ CUDA ทำให้คอมพิวเตอร์ของคุณกลายเป็นระบบไฮเปอร์เทอร์เจนัสประสิทธิภาพสูง ลองนึกภาพ ผู้กำกับใหญ่ (โฮสต์/ซีพียู) และกองทัพหมื่นคน กองทัพหมื่นคน (อุปกรณ์/จีพียู)ผู้กำกับใหญ่จัดการตรรกะที่ซับซ้อนและการตัดสินใจ ขณะที่กองทัพดำเนินงานหนักๆ แบบซ้ำๆ พร้อมกัน

1. ความแตกต่างด้านสถาปัตยกรรม

โฮสต์ โฮสต์ เป็นซีพียูที่ปรับให้ลดเวลาหน่วง (latency) โดยออกแบบมาเพื่อการควบคุมตรรกะที่ซับซ้อน และงานแบบลำดับขั้นตอน ตรงข้ามกับ อุปกรณ์ เป็นจีพียูที่ปรับให้เพิ่มผลผลิต (throughput) ประกอบด้วยคอร์ง่ายๆ หลายพันตัว ถูกออกแบบมาเพื่อประมวลผลคำสั่งเดียวกันพร้อมกันบนชุดข้อมูลขนาดใหญ่

2. จังหวะการดำเนินงาน

โปรแกรม CUDA จะทำงานเป็นช่วงๆ ตามลำดับ การดำเนินงานเริ่มต้นที่โฮสต์สำหรับโค้ดแบบลำดับ พอโปรแกรมพบกับ "เคอร์เนลแบบขนาน" มันจะเริ่มส่ง กริด ของเธรดไปยังอุปกรณ์ เมื่ออุปกรณ์เสร็จงานหนักๆ แล้ว คอนโทรลจะกลับไปที่โฮสต์

3. การเฉพาะทางด้านประสิทธิภาพ

โมเดลนี้ใช้ประโยชน์จากจุดแข็งของทั้งสอง: ซีพียูจัดการทรัพยากรระบบและสาขาที่ซับซ้อน ส่วนจีพียูจะดำเนินการ SPMD (โปรแกรมเดียว ข้อมูลหลายชุด) ตรรกะเพื่อประมวลผลองค์ประกอบข้อมูลพร้อมกัน

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

Which architecture is characterized as being 'throughput-optimized'?

The Host (Intel® CPU)

The Device (NVIDIA® GPU)

The System RAM

The PCIe Bus

QUESTION 2

The reader should complete Part 1 of the MatrixMultiplication() example in Figure 3.6 with similar declarations of an Nd and a Pd pointer variable as well as their corresponding cudaMalloc() calls. Furthermore, Part 3 in Figure 3.6 can be completed with mandatory calls.

float *Nd, *Pd; cudaMalloc((void**)&Nd, size); ... cudaFree(Nd);

float Nd, Pd; malloc(&Nd, size); ... free(Nd);

float *Nd, *Pd; cudaMemcpy(Nd, Pd, size); ... delete Nd;

int Nd, Pd; Nd = new float[size]; ... free(Nd);

QUESTION 3

In the CUDA execution model, where does a program always begin its execution?

On the Device (GPU)

Simultaneously on both

On the Host (CPU)

In the Global Memory

QUESTION 4

What happens when the Host encounters a phase with rich data parallelism?

It speeds up its clock frequency.

It launches a Kernel onto the Device.

It stores the data in the Host Cache.

It converts the code to Python.

QUESTION 5

A student attempts to launch a 1024x1024 matrix multiplication on G80 hardware using 1024 blocks, where each thread calculates one element. Why will this fail?

The G80 cannot handle 1024 blocks.

The total number of threads exceeds 1 million.

The configuration results in 1024 threads per block, exceeding the 512 hardware limit.

Matrix multiplication is not data parallel.